NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Structured in Space, Randomized in Time: Leveraging Dropout in RNNs for Efficient Training

Sarma, Anup; Singh, Sonali; Jiang, Huaipan; Zhang, Rui; Kandemir, Mahmut; Das, Chita. (January 2022, Advances in neural information processing systems)

Full Text Available
Exploiting Activation based Gradient Output Sparsity to Accelerate Backpropagation in CNNs

Sarma, Anup; Singh, Sonali; Jiang, Huaipan; Pattnaik, Ashutosh; Mishra, Asit K.; Narayanan, Vijaykrishnan; Kandemir, Mahmut T.; Das, Chita R. (September 2021, ArXivorg)

Full Text Available
CASH: compiler assisted hardware design for improving DRAM energy efficiency in CNN inference

https://doi.org/10.1145/3357526.3357536

Sarma, Anup; Jiang, Huaipan; Pattnaik, Ashutosh; Kotra, Jagadish; Kandemir, Mahmut Taylan; Das, Chita R. (January 2019, International Symposium on Memory Systems)

The advent of machine learning (ML) and deep learning applications has led to the development of a multitude of hardware accelerators and architectural optimization techniques for parallel architectures. This is due in part to the regularity and parallelism exhibited by the ML workloads, especially convolutional neural networks (CNNs). However, CPUs continue to be one of the dominant compute fabric in datacenters today, thereby also being widely deployed for inference tasks. As CNNs grow larger, the inherent limitations of a CPU-based system become apparent, specifically in terms of main memory data movement. In this paper, we present CASH, a compiler-assisted hardware solution that eliminates redundant data-movement to and from the main memory and, therefore, reduces main memory bandwidth and energy consumption. Our experimental evaluations on a set of four different state-of-the-art CNN workloads indicate that CASH provides, on average, ~40% and ~18% reductions in main memory bandwidth and energy consumption, respectively.
more » « less
Full Text Available
Controlled Kernel Launch for Dynamic Parallelism in GPUs

https://doi.org/10.1109/HPCA.2017.14

Tang, Xulong; Pattnaik, Ashutosh; Jiang, Huaipan; Kayiran, Onur; Jog, Adwait; Pai, Sreepathi; Ibrahim, Mohamed; Kandemir, Mahmut T.; Das, Chita R. (February 2017, IEEE International Symposium on High Performance Computer Architecture (HPCA))

Dynamic parallelism (DP) is a promising feature for GPUs, which allows on-demand spawning of kernels on the GPU without any CPU intervention. However, this feature has two major drawbacks. First, the launching of GPU kernels can incur significant performance penalties. Second, dynamically-generated kernels are not always able to efficiently utilize the GPU cores due to hardware-limits. To address these two concerns cohesively, we propose SPAWN, a runtime framework that controls the dynamically-generated kernels, thereby directly reducing the associated launch overheads and queuing latency. Moreover, it allows a better mix of dynamically-generated and original (parent) kernels for the scheduler to effectively hide the remaining overheads and improve the utilization of the GPU resources. Our results show that, across 13 benchmarks, SPAWN achieves 69% and 57% speedup over the flat (non-DP) implementation and baseline DP, respectively.
more » « less
Full Text Available

Search for: All records